Skip to content

Conversation

ArangoGutierrez
Copy link
Collaborator

No description provided.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive E2E testing for containerd runtime configuration alongside existing Docker testing infrastructure. The changes introduce a nested container testing framework that allows running tests inside containers to validate NVIDIA Container Toolkit behavior in containerized environments.

  • Adds new E2E tests for containerd drop-in configuration functionality
  • Introduces nvidia-cdi-refresh systemd unit testing
  • Implements nested container runner infrastructure for isolated testing

Reviewed Changes

Copilot reviewed 9 out of 32 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/go.mod Adds new dependencies for UUID generation and test utilities
tests/e2e/runner.go Implements nested container runner with Docker installation and CTK setup
tests/e2e/nvidia-ctk_containerd_test.go New comprehensive containerd E2E test suite
tests/e2e/nvidia-ctk_docker_test.go Refactors to use shared runner infrastructure and fixes macOS compatibility
tests/e2e/nvidia-cdi-refresh_test.go New systemd unit tests for CDI refresh functionality
tests/e2e/nvidia-container-cli_test.go Refactors to use nested container runner
tests/e2e/installer.go Adds containerd installation template and additional flags support
tests/e2e/e2e_test.go Centralizes test runner initialization in BeforeSuite
tests/e2e/Makefile Documents new test categories

@ArangoGutierrez
Copy link
Collaborator Author

Builds on #1235

Doesn't include #1311 tests for that should be added as a follow up

@coveralls
Copy link

coveralls commented Sep 23, 2025

Pull Request Test Coverage Report for Build 18005738357

Details

  • 0 of 1 (0.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.006%) to 36.277%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/config/engine/containerd/config_drop_in.go 0 1 0.0%
Totals Coverage Status
Change from base Build 17981864462: 0.006%
Covered Lines: 4827
Relevant Lines: 13306

💛 - Coveralls

@ArangoGutierrez
Copy link
Collaborator Author

I'll mark this PR as ready for review once #1235 is merged

@ArangoGutierrez
Copy link
Collaborator Author

I'll mark this PR as ready for review once #1235 is merged

Rebased

@ArangoGutierrez ArangoGutierrez force-pushed the e2e_containerd branch 3 times, most recently from 029af03 to 1899001 Compare September 25, 2025 11:16
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review September 25, 2025 11:26
@elezar elezar marked this pull request as draft October 13, 2025 12:10
@ArangoGutierrez ArangoGutierrez force-pushed the e2e_containerd branch 2 times, most recently from 51ad031 to c65e468 Compare October 13, 2025 13:57
@ArangoGutierrez
Copy link
Collaborator Author

Rebased

@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review October 13, 2025 14:01
AfterAll(func(ctx context.Context) {
// Cleanup: remove the container and the temporary script on the host.
// Use || true to ensure cleanup doesn't fail the test
runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the nolint let's just drop the return values.

Suggested change
runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck
_, _, _ = runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName))

Does it mak sense to at least WARN if the cleanup fails? The || true doesn't ensure that the test doesn't fail, the fact that we don't check the return value does that.

Comment on lines 56 to 59
# Remove any imports line from the config (reset to original state)
if [ -f /etc/containerd/config.toml ]; then
grep -v "^imports = " /etc/containerd/config.toml > /tmp/config.toml.tmp && mv /tmp/config.toml.tmp /etc/containerd/config.toml || true
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just make a copy of the original config and restore that after / before each test?


# Restart containerd to pick up the clean config
systemctl restart containerd
sleep 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to check containerd health?

Comment on lines 80 to 84
output, _, err := nestedContainerRunner.Run(`cat /etc/containerd/conf.d/99-nvidia.toml`)
Expect(err).ToNot(HaveOccurred())
Expect(output).To(ContainSubstring(`nvidia`))
Expect(output).To(ContainSubstring(`nvidia-cdi`))
Expect(output).To(ContainSubstring(`nvidia-legacy`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in person, we are nolonger triggering the configuration of containerd with the current installation mechanism.

output, _, err = nestedContainerRunner.Run(`containerd config dump`)
Expect(err).ToNot(HaveOccurred())
// Verify imports section is in the merged config
Expect(output).To(ContainSubstring(`imports = ['/etc/containerd/conf.d/*.toml']`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think that config dump prints that ACTUAL paths of all files processed.

Comment on lines +97 to +137
ContainSubstring(`default_runtime_name = "nvidia"`),
ContainSubstring(`default_runtime_name = 'nvidia'`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should definitely be VERSION specific checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having tests that pass in multiple states are prone to be flaky. Please pick the format for the version of containerd that we have installed by default and use that. This makes the differences between SPECIFIC containerd versions more obvious when reading the tests.

ContainSubstring(`default_runtime_name = "nvidia"`),
ContainSubstring(`default_runtime_name = 'nvidia'`),
))
Expect(output).To(ContainSubstring(`enable_cdi = true`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we toggle this behaviour? It is disabled by default.

Comment on lines +105 to +145
ContainSubstring(`[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]`),
ContainSubstring(`[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, thsi should be version-specific.

})

When("containerd already has a custom default runtime configured", func() {
It("should preserve the existing default runtime when --set-as-default=false is specified", func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--set-as-default=false is not specified. It is the default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we don't strictly speaking need a custom runtime to test the behaviour of NOT overriding the default.

`)
Expect(err).ToNot(HaveOccurred())

// Configure containerd with drop-in config (explicitly not setting as default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we configuring with the drop-in config in this case?

})

When("containerd has multiple custom runtimes and plugins configured", func() {
It("should add NVIDIA runtime alongside existing runtimes like kata", func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of interest, how is this different to an arbitrarry "custom" runtime?

Expect(err).ToNot(HaveOccurred())

// Verify kata runtime was added
output, _, err := nestedContainerRunner.Run(`systemctl restart containerd && sleep 2 && containerd config dump | grep -A5 kata`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the -A5? Also. Please split the different steps.

Expect(err).ToNot(HaveOccurred())
Expect(output).To(ContainSubstring(`kata`))

// Configure containerd with drop-in config and set NVIDIA as default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are we setting these options?

Comment on lines 246 to 252
_, _, err := nestedContainerRunner.Run(`
rm -f /etc/containerd/config.toml
rm -rf /etc/containerd/conf.d
mkdir -p /etc/containerd/conf.d

cat > /etc/containerd/config.toml <<'EOF'
version = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we ensure that this is only run on containerd versions that support it?


# Create a custom config that will be imported
# Use the correct plugin path for containerd v2/v3
cat > /etc/containerd/custom.d/10-custom.toml <<'EOF'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the correct path was: /etc/containerd/conf.d? https://github.com/containerd/containerd/pull/12323/files

Expect(err).ToNot(HaveOccurred())

// Verify containerd can load the custom import before installer
_, _, err = nestedContainerRunner.Run(`systemctl restart containerd && sleep 2 && containerd config dump | grep -i myregistry`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we randomly checking for myregistry? Is this sufficient?

// Run a container with NVIDIA runtime
// Note: We use native snapshotter because overlay doesn't work in nested containers
output, stderr, err = nestedContainerRunner.Run(`
ctr run --rm --snapshotter native docker.io/library/busybox:latest test-nvidia echo "Hello from NVIDIA runtime"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this use the nvidia runtime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if no devices are requested, then the nvidia-runtime is a no-op.


// Pull a test image
output, stderr, err := nestedContainerRunner.Run(`
ctr image pull docker.io/library/busybox:latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ctr automatically use the containerd config?

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

When("configuring containerd on a Kubernetes node", func() {
It("should add NVIDIA runtime using drop-in config without modifying the main config", func(ctx context.Context) {
// Configure containerd using nvidia-ctk
cmd := fmt.Sprintf(nvidiaCtkConfigureContainerdCmd, "--nvidia-set-as-default --cdi.enabled")
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag '--nvidia-set-as-default' is inconsistent with the flag used elsewhere ('--set-as-default=false') and is likely invalid in nvidia-ctk. Replace with '--set-as-default' to ensure the command executes correctly.

Suggested change
cmd := fmt.Sprintf(nvidiaCtkConfigureContainerdCmd, "--nvidia-set-as-default --cdi.enabled")
cmd := fmt.Sprintf(nvidiaCtkConfigureContainerdCmd, "--set-as-default --cdi.enabled")

Copilot uses AI. Check for mistakes.

Expect(output).To(ContainSubstring(`kata`))

// Configure containerd using nvidia-ctk with nvidia as default
cmd := fmt.Sprintf(nvidiaCtkConfigureContainerdCmd, "--nvidia-set-as-default --cdi.enabled")
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: '--nvidia-set-as-default' should be '--set-as-default' to match nvidia-ctk CLI flags.

Suggested change
cmd := fmt.Sprintf(nvidiaCtkConfigureContainerdCmd, "--nvidia-set-as-default --cdi.enabled")
cmd := fmt.Sprintf(nvidiaCtkConfigureContainerdCmd, "--set-as-default --cdi.enabled")

Copilot uses AI. Check for mistakes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer --set-as-default over --nvidia-set-as-default.

// Verify NVIDIA runtime was added
Expect(output).To(SatisfyAny(
ContainSubstring(`[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]`),
ContainSubstring(`[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]`),
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second ContainSubstring uses single quotes around the plugin key; containerd config dump uses double quotes. Update to [plugins.\"io.containerd.cri.v1.runtime\".containerd.runtimes.nvidia] to avoid false negatives when matching v3 config structure.

Suggested change
ContainSubstring(`[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]`),
ContainSubstring(`[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.nvidia]`),

Copilot uses AI. Check for mistakes.

Expect(err).ToNot(HaveOccurred(), "Failed to install toolkit for containerd")
})

AfterAll(func(ctx context.Context) {
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If BeforeAll fails before initializing nestedContainerRunner, this AfterAll will panic by calling Run on a nil interface. Add a nil check guard (e.g., if nestedContainerRunner == nil { best-effort docker rm and return }) before using it.

Suggested change
AfterAll(func(ctx context.Context) {
AfterAll(func(ctx context.Context) {
// If nestedContainerRunner is nil, BeforeAll failed before initialization.
// Best-effort cleanup: remove the container and return.
if nestedContainerRunner == nil {
_, _, err := runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName))
if err != nil {
GinkgoWriter.Printf("Warning: failed to cleanup container %s: %v\n", containerName, err)
}
return
}

Copilot uses AI. Check for mistakes.

Comment on lines 147 to 164

# Restart containerd and wait for it to be ready
systemctl restart containerd

# Wait for containerd to be responsive (up to 30 seconds)
for i in $(seq 1 30); do
if ctr version >/dev/null 2>&1; then
echo "containerd is ready"
break
fi
if [ "$i" -eq 30 ]; then
echo "ERROR: containerd failed to become ready"
exit 1
fi
sleep 1
done
`)
Expect(err).ToNot(HaveOccurred(), "Failed to reset containerd configuration")
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restart-and-wait logic is duplicated here and again at lines 289–298. Consider extracting this into a small helper (e.g., restartContainerdAndWait(r Runner)) to reduce duplication and keep tests concise.

Suggested change
# Restart containerd and wait for it to be ready
systemctl restart containerd
# Wait for containerd to be responsive (up to 30 seconds)
for i in $(seq 1 30); do
if ctr version >/dev/null 2>&1; then
echo "containerd is ready"
break
fi
if [ "$i" -eq 30 ]; then
echo "ERROR: containerd failed to become ready"
exit 1
fi
sleep 1
done
`)
Expect(err).ToNot(HaveOccurred(), "Failed to reset containerd configuration")
`)
Expect(err).ToNot(HaveOccurred(), "Failed to reset containerd configuration")
// Restart containerd and wait for it to be ready
err = restartContainerdAndWait(nestedContainerRunner)
Expect(err).ToNot(HaveOccurred(), "Failed to restart and wait for containerd")

Copilot uses AI. Check for mistakes.

nestedContainerCacheDir = strings.TrimSpace(tmpdir)

// Prepare cache on the remote runner (where Docker is)
installer, err := NewToolkitInstaller(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that we can't use the "globa" toolkitInstaller?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't thought on that, now simplified


# Backup custom.d directory if it exists
if [ -d /etc/containerd/custom.d ]; then
cp -r /etc/containerd/custom.d /tmp/containerd-custom.d.backup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The containerd source code does not seem to include custom.d. Why are we copying this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was creating one during devel. not needed any more


# Backup config.d directory if it exists
if [ -d /etc/containerd/config.d ]; then
cp -r /etc/containerd/config.d /tmp/containerd-config.d.backup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path that is searched by default is /etc/containerd/conf.d. See containerd/containerd#12323

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 70 to 74
# Backup the original config.toml
if [ -f /etc/containerd/config.toml ]; then
cp /etc/containerd/config.toml /tmp/containerd-config.toml.backup
fi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a thought. Does using -f -r as arguments for cp for all of the things we want to backup make this (and the subsequent cleanup) simpler?

systemctl restart containerd || true
`)
if err != nil {
GinkgoWriter.Printf("Warning: failed to restore containerd configuration: %v\n", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does GinkgoLogr not have an actual Warning method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise just use GinkgoLogr.Error here?

BeforeEach(func(ctx context.Context) {
// Restore from backup to ensure clean state for each test
_, _, err := nestedContainerRunner.Run(`
# Restore the original config.toml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems strange to create the backup in BeforeAll and then to restore the backup again directly in BeforeEach. Why not restore the backup in AfterEach instead? This would mean that we don't need the logic in AfterAll?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, refactored

Comment on lines 148 to 162
# Restart containerd and wait for it to be ready
systemctl restart containerd

# Wait for containerd to be responsive (up to 30 seconds)
for i in $(seq 1 30); do
if ctr version >/dev/null 2>&1; then
echo "containerd is ready"
break
fi
if [ "$i" -eq 30 ]; then
echo "ERROR: containerd failed to become ready"
exit 1
fi
sleep 1
done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be logically separate from restoring the backups.

Expect(err).ToNot(HaveOccurred(), "Failed to reset containerd configuration")
})

When("configuring containerd on a Kubernetes node", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We're not on a Kubernetes node.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edited

When("configuring containerd on a Kubernetes node", func() {
It("should add NVIDIA runtime using drop-in config without modifying the main config", func(ctx context.Context) {
// Configure containerd using nvidia-ctk
cmd := fmt.Sprintf(nvidiaCtkConfigureContainerdCmd, "--nvidia-set-as-default --cdi.enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's definitely just use string concatenation here. I would even go so far as to say that we should just duplicate the command every time we want to run it.

Comment on lines 206 to 207
rm -rf /etc/containerd/config.d
mkdir -p /etc/containerd/config.d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing and creating this folder? Does this not contradict us restoring the backups in BeforeEach?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, adopted

Comment on lines +223 to +170
SystemdCgroup = true
CustomOption = "custom-value"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are theses values relevant? Should we check that the applied config includes these settings for the nvidia runtime?

Comment on lines 260 to 261
// NVIDIA runtime should be added
Expect(output).To(ContainSubstring(`nvidia`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very BROAD check.

Comment on lines 258 to 259
// Custom runtime should be preserved with all its options
Expect(output).To(MatchRegexp(`(?s)custom-runtime.*CustomOption.*custom-value`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what this checks. For example the SystemdCgroups setting is not checked.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

ContainSubstring(`[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]`),
ContainSubstring(`[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]`),
))
Expect(output).To(ContainSubstring(`BinaryName = `)) // The NVIDIA container runtime binary
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is too weak and may produce false positives since any runtime's BinaryName line would satisfy it. Strengthen it to assert the NVIDIA runtime binary specifically, e.g., use a regex similar to other tests: Expect(output).To(MatchRegexp((?s)runtimes\\.nvidia\\].*options\\].*BinaryName.*nvidia-container-runtime)).

Suggested change
Expect(output).To(ContainSubstring(`BinaryName = `)) // The NVIDIA container runtime binary
Expect(output).To(MatchRegexp(`(?s)runtimes\.nvidia\].*options\].*BinaryName.*nvidia-container-runtime`)) // The NVIDIA container runtime binary

Copilot uses AI. Check for mistakes.

Comment on lines +92 to +104
AfterEach(func(ctx context.Context) {
// Step 1: Restore containerd configuration from backup
_, _, err := nestedContainerRunner.Run(`
# Restore the original conf.d
if [ -d /tmp/containerd-conf.d.backup ]; then
rm -rf /etc/containerd/conf.d
cp -r /tmp/containerd-conf.d.backup /etc/containerd/conf.d
else
# If no backup exists, just clean up
rm -rf /etc/containerd/conf.d
mkdir -p /etc/containerd/conf.d
fi
`)
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only conf.d is restored; /etc/containerd/config.toml is left mutated across tests while the suite uses Ordered and ContinueOnFailure. A failure can leave broken state for subsequent tests. Consider also backing up and restoring /etc/containerd/config.toml (e.g., copy to /tmp/containerd-config.toml.backup in BeforeAll and restore here), or remove ContinueOnFailure to avoid cascading failures.

Copilot uses AI. Check for mistakes.


When("containerd has multiple custom runtimes and plugins configured", func() {
It("should add NVIDIA runtime alongside existing runtimes like kata", func(ctx context.Context) {
// Add some custom settings to the existing kindest/base config
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is misleading because tests explicitly share state (Ordered) and earlier specs overwrite /etc/containerd/config.toml. Either ensure a fresh baseline before appending (e.g., regenerate the default config) or update the comment to reflect that you are appending to the current config state from prior tests.

Suggested change
// Add some custom settings to the existing kindest/base config
// Add some custom settings to the current config state (which may have been modified by previous tests)

Copilot uses AI. Check for mistakes.

Comment on lines +143 to +146
Expect(output).To(SatisfyAny(
ContainSubstring(`[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]`),
ContainSubstring(`[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]`),
))
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates the earlier runtime-section presence check in the same test (lines 127–130). Consider removing one of these blocks to reduce redundancy.

Suggested change
Expect(output).To(SatisfyAny(
ContainSubstring(`[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]`),
ContainSubstring(`[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]`),
))

Copilot uses AI. Check for mistakes.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some unanswered questions below.

Also, the intent of this PR is to test how the merged configs are handled by different containerd versions. I don't feel as if this really achieve this since we're only ever testing the version of containerd that is included in the kind node image.

Can one parameterize the test for 1.7.x and 2.0 so that we can validate the behavior there? (I'm happy to do this as a follow-up, but it would be good to understand how we would handle the expected differences).

Comment on lines +68 to +71
# Backup the original conf.d directory
if [ -d /etc/containerd/conf.d ]; then
cp -r /etc/containerd/conf.d /tmp/containerd-conf.d.backup
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about /etc/containerd/config.toml (if it exists)?

Expect(err).ToNot(HaveOccurred(), "Failed to restart containerd after configuration restore")
})

When("configuring containerd on a KIND node", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why reference KIND at all? We only selected the kind nodes since this was the simplest way to get containerd running in a container.

Suggested change
When("configuring containerd on a KIND node", func() {
When("configuring containerd", func() {

Comment on lines +128 to +129
ContainSubstring(`[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]`),
ContainSubstring(`[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As called out a number of times, this check is too broad. We should capture the differences between different containerd versions explicitly.

})

When("configuring containerd on a KIND node", func() {
It("should add NVIDIA runtime using drop-in config without modifying the main config", func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true for containerd versions that INCLUDE the default /etc/containerd/conf.d imports.

))

// Verify the drop-in config was processed (config dump shows actual imported files)
Expect(output).To(ContainSubstring(`/etc/containerd/conf.d/*.toml`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please load the output config as toml and check specific values instead of checking substrings.

Comment on lines +140 to +141
// Verify KIND settings are still present
Expect(output).To(ContainSubstring(`SystemdCgroup = true`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this a Kind setting? Does the kind node ALWAYS use systemd cgroups?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants